Daniel J. McDonald
Department of Statistics
University of British Columbia
26 October 2020
I can’t cover all of this! I’ll focus on forecasting philosophy and evaluation
Outline:
Implemented with
Reproducible talk:
I can’t cover all of this! I’ll focus on forecasting philosophy and evaluation
Outline:
Implemented with {evalcast} package + API
Reproducible talk: all code included
How many people have died from COVID-19 per day, in my state, since March 1?
library(covidcast)
deaths = covidcast_signal(data_source = "usa-facts",
signal = "deaths_7dav_incidence_num",
start_day = "2020-03-01",
end_day = "2020-10-15",
geo_type = "state", geo_values = "in")
plot(deaths, plot_type = "line",
title = "COVID-19 deaths in IN (7-day trailing average)")What percentage of daily hospital admissions are due to COVID-19 in IN, NY, TX, CA?
hosp = covidcast_signal(data_source = "hospital-admissions",
signal = "smoothed_adj_covid19",
start_day = "2020-03-01", end_day = "2020-10-15",
geo_type = "state",
geo_values = c("in", "ny", "tx", "ca"))
plot(hosp, plot_type = "line",
title = "% of hospital admissions due to COVID-19")What does the current COVID-19 incident case rate look like, nationwide?
cases = covidcast_signal(data_source = "usa-facts",
signal = "confirmed_7dav_incidence_prop",
start_day = "2020-10-21", end_day = "2020-10-21")
plot(cases) +
ggtitle("Daily new COVID-19 cases per 100,000 people",
subtitle = "Average for week ending 21 October")How does mask behavior relate to symptoms?
masks = covidcast_signal(data_source = "fb-survey",
signal = "smoothed_wearing_mask",
start_day = "2020-10-15", end_day = "2020-10-15",
geo_type="state") %>%
select(geo_value, value) %>% rename(masks=value)
cli = covidcast_signal(data_source = "fb-survey",
signal = "smoothed_hh_cmnty_cli",
start_day = "2020-10-15", end_day = "2020-10-15",
geo_type="state") %>%
select(geo_value, value) %>% rename(cli=value)
full_join(masks, cli) %>%
ggplot(aes(masks, cli)) + geom_text(aes(label=geo_value), color="#bb0000") +
theme_bw() + xlab("% wearing masks") + ylab("% COVID-like illness") +
ggtitle("Average for week ending 15 October")By default the API returns the most recent data for each time_value.
We also provide access to all previous versions of the data:
| Parameter | To get data … | Examples |
|---|---|---|
as_of |
as if we queried the API on a particular date | 20200406 |
issues |
published at a particular date or date range | 20200406 or 20200406-20200410 |
lag |
published a certain number of time units after events occured | 1 or 3 |
Why would we need this? Because many data sources are subject to revisions:
This presents a challenge to modelers: e.g., we have to learn how to forecast based on the data we’d have at the time, not updates that would arrive later
To accommodate, we log revisions even when the original data source does not!
The last two weeks of August in CA …
# Let's get the data that was available as of 09/22, for the end of August in CA
dv = covidcast_signal(data_source = "doctor-visits",
signal = "smoothed_adj_cli",
start_day = "2020-08-15", end_day = "2020-08-31",
geo_type = "state", geo_values = "ca",
as_of = "2020-09-21")
# Plot the time series curve
xlim = c(as.Date("2020-08-15"), as.Date("2020-09-21"))
ylim = c(3.83, 5.92)
ggplot(dv, aes(x = time_value, y = value)) +
geom_line() +
coord_cartesian(xlim = xlim, ylim = ylim) +
geom_vline(aes(xintercept = as.Date("2020-09-21")), lty = 2) +
labs(color = "as of", x = "Date", y = "% doctor's visits due to CLI in CA") +
theme_bw() + theme(legend.pos = "bottom")The last two weeks of August in CA …
# Now loop over a bunhch of "as of" dates, fetch data from the API for each one
as_ofs = seq(as.Date("2020-09-01"), as.Date("2020-09-21"), by = "week")
dv_as_of = map_dfr(as_ofs, function(as_of) {
covidcast_signal(data_source = "doctor-visits", signal = "smoothed_adj_cli",
start_day = "2020-08-15", end_day = "2020-08-31",
geo_type = "state", geo_values = "ca", as_of = as_of)
})
# Plot the time series curve "as of" September 1
dv_as_of %>%
filter(issue == as.Date("2020-09-01")) %>%
ggplot(aes(x = time_value, y = value)) +
geom_line(aes(color = factor(issue))) +
geom_vline(aes(color = factor(issue), xintercept = issue), lty = 2) +
coord_cartesian(xlim = xlim, ylim = ylim) +
labs(color = "as of", x = "Date", y = "% doctor's visits due to CLI in CA") +
geom_line(data = dv, aes(x = time_value, y = value)) +
geom_vline(aes(xintercept = as.Date("2020-09-21")), lty = 2) +
theme_bw() + theme(legend.pos = "none")The last two weeks of August in CA …
dv_as_of %>%
ggplot(aes(x = time_value, y = value)) +
geom_line(aes(color = factor(issue))) +
geom_vline(aes(color = factor(issue), xintercept = issue), lty = 2) +
coord_cartesian(xlim = xlim, ylim = ylim) +
labs(color = "as of", x = "Date", y = "% doctor's visits due to CLI in CA") +
geom_line(data = dv, aes(x = time_value, y = value)) +
geom_vline(aes(xintercept = as.Date("2020-09-21")), lty = 2) +
theme_bw() + theme(legend.pos = "none")In backtesting, we should provide the forecaster the data that would have been available as of the forecast date. Otherwise, performance assessment may be naively optimistic.
Trained forecasters that do not account for backfill may learn to rely too heavily on recent data.
Evaluation relies on the “actual outcome”, but this might not be reliably known until some time has passed.
Note: evalcast has a backfill_buffer parameter that forces one to wait a certain amount of time before trying to evaluate.
Some data has other obvious issues …
riga = suppressMessages(covidcast_signal(
"usa-facts","deaths_incidence_num",geo_type = "state",
geo_values = c("ri","ga")))
plot(filter(riga, geo_value=="ga"),
plot_type="line", title="usa-facts: deaths_incidence_num --- GA")Some data has other obvious issues …
plot(filter(riga, geo_value=="ri"),
plot_type="line", title="usa-facts: deaths_incidence_num --- RI")source("correct-riga.R") # available in the Github repo,
# removed due to length
riga %>% filter(time_value > ymd("2020-04-01")) %>%
dplyr::select(geo_value, time_value, value, corrected, flag) %>%
pivot_longer(value:corrected) %>%
ggplot(aes(time_value)) + geom_line(aes(y=value,color=name)) +
geom_point(data = filter(riga, flag), aes(y=value), color="red") +
facet_wrap(~geo_value, scales = "free_y", ncol = 2) +
theme_bw() + xlab("date") +
theme(legend.position = "bottom",legend.title = element_blank())+
ylab(attributes(riga)$metadata$signal) +
scale_color_viridis_d()epiweekThe predictions.Rdata file was producet with ‘predcard-processing.R’
reds = RColorBrewer::brewer.pal(5,"Reds")[5:2]
ggplot(base_loc, aes(pred_date)) +
geom_point(aes(y=`0.5`,group=forecast_dates, color=as.factor(ahead))) +
scale_color_manual(values = reds) +
ylab("Deaths incidence FL") + xlab("date") +
geom_line(aes(y=`0.5`,group=forecast_dates)) +
geom_line(data=loc_truth, aes(date, lag_sum), color="blue") +
geom_point(data=loc_truth, aes(date, lag_sum), color="blue") +
geom_ribbon(aes(ymin=`0.1`, ymax=`0.9`,group=forecast_dates), alpha=.2) +
theme_bw() + theme(legend.position = "") +
coord_cartesian(xlim=ymd(c("2020-08-01","2020-11-15")), ylim=c(0,1600))ggplot(cmu_loc, aes(pred_date)) +
geom_point(aes(y=`0.5`,group=forecast_dates, color=as.factor(ahead))) +
scale_color_manual(values = reds) +
geom_line(aes(y=`0.5`,group=forecast_dates)) +
ylab("Deaths incidence FL") + xlab("date") +
geom_line(data=loc_truth, aes(date, lag_sum), color="blue") +
geom_point(data=loc_truth, aes(date, lag_sum), color="blue") +
geom_ribbon(aes(ymin=`0.1`, ymax=`0.9`,group=forecast_dates), alpha=.2) +
theme_bw() + theme(legend.position = "") +
coord_cartesian(xlim=ymd(c("2020-08-01","2020-11-15")), ylim=c(0,1600))ggplot(ens_loc, aes(pred_date)) +
geom_point(aes(y=`0.5`,group=forecast_dates, color=as.factor(ahead))) +
scale_color_manual(values = reds) +
geom_line(aes(y=`0.5`,group=forecast_dates)) +
ylab("Deaths incidence FL") + xlab("date") +
geom_line(data=loc_truth, aes(date, lag_sum), color="blue") +
geom_point(data=loc_truth, aes(date, lag_sum), color="blue") +
geom_ribbon(aes(ymin=`0.1`, ymax=`0.9`,group=forecast_dates), alpha=.2) +
theme_bw() + theme(legend.position = "") +
coord_cartesian(xlim=ymd(c("2020-08-01","2020-11-15")), ylim=c(0,1600))For more on weighted interval score, see, for example, Bracher et al. (2020).
ggplot(overall, aes(forecast_date, avg_relative_score, color=name)) +
geom_line(aes(linetype=ahead)) + geom_point() +
theme_bw()3 epiweeks ahead
A miscalibrated 80% interval
# Blue, red (similar to ggplot defaults)
ggplot_colors = c("#00AFBB", "#FC4E07")
x = seq(-3, 3, length = 1000)
y = dnorm(x)
q1 = qnorm(0.1); q1_hat = qnorm(0.07)
q2 = qnorm(0.9); q2_hat = qnorm(0.85)
par(mar = c(0,0,0,0))
plot(x, dnorm(x), type = "l", axes = FALSE)
abline(v = c(q1, q2), lwd = 2, lty = 2, col = "gray")
abline(v = c(q1_hat, q2_hat), lwd = 2, lty = 2, col = ggplot_colors)
polygon(c(x[x < q1_hat], max(x[x < q1_hat])),
c(y[x < q1_hat], min(y[x < q1_hat])),
col = adjustcolor(ggplot_colors[1], alpha.f = 0.5), border = NA)
polygon(c(min(x[x > q2_hat]), x[x > q2_hat]),
c(min(y[x > q2_hat]), y[x > q2_hat]),
col = adjustcolor(ggplot_colors[2], alpha.f = 0.5), border = NA)
text(min(x), max(y[x < q1_hat]), labels = "0.07 (should be 0.1)", pos = 4)
text(max(x), max(y[x < q1_hat]), labels = "0.15 (should be 0.1)", pos = 2)Image credit: Ryan Tibshirani
scorecards = subsetted
plot_width("bug here") + scale_y_log10() + theme(legend.pos = "bottom") +
theme_bw()Delphi’s COVIDcast ecosystem has many parts:
In this pandemic, it’ll take an entire community to find answers to all the important questions.
We have piles of projects/issues/questions. See CMU-Delphi if you have spare cycles.